Scope of coding in Olympics 2040
Group Codelympics: Mahir Dave, Parth Nanwani, Priyanka Chib, Smriti Gangwani, Vishwa Thakkar, Zaid Dar
The Olympic Games, also known as the Games of the Olympiad, are a major international multi-sport event normally held once every four years (every two years before 1992 ). There are over 400 total events in 40 sports in the Olympic Tournament. The Summer Olympics consists of 33 different sports, and the Winter Olympics consists of 7 different sports. Various data is generated based on the events and their winning players. We will analyze the data of multiple Olympics events to find various inferences and correlations from the dataset in python using different visualizations. We have combined the olympics dataset with the national olympics committee to create a dataframe by countries and total medals won.
The project consists of a dataset belonging to the Olympics games from the year 1896 to 2016. The dataset includes biological statistics of the athletes, such as height, weight, age, and event-related statistics such as sports, medals won, country, etc. This dataset would help us analyze the evolution of the Olympics in various sports and events and also, analyze the performance of different countries and participants of a different gender.
Retrieved two datasets from www.sports-reference.com and acquired from kaggle.com. The data in the first set (‘athlete_events.csv’) contains all the information categories like name of the player, physical characteristics of players (age,sex,weighr,heigth), NOC (country code), assigned team, type of games, games participated in, season, event and medal ranging from 1896 to 2016. As we observe, the column NOC in this dataset specifies the National Olympic committee. In order to get the region data from the NOC, we merged two datasets. The other dataset (‘noc_regions.csv’) maps the NOC value with the region. Therefore we use this common column to merge two datasets into one. This could be useful to determine what players play for which region.Also, we have done web scrapping from https://statisticstimes.com/economy/projected-world-gdp-ranking.php to find GDP of top countries.
All the questions above will help us determine what factors play a vital role in the performance of a country/player in olympics, while showing us trends and interesting details that we can use to make inferences regarding the various fields of data. The analysis performed will combine various factors such BMI, GDP and even Politics to determine the winning factor for various athletes belonging to different regions. The analysis performed not only derived the best possible outcome but also helps to improve the performance. Hence, for this project we mill majorly focus on Data Analysis
#Importing the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import matplotlib
import seaborn as sns
#Improting the data into a Dataframe for cleaning and analysis
data =pd.read_csv('athlete_events.csv')
region_map = pd.read_csv('noc_regions.csv')
#Merging the two datasets based on the common attribute NOC
data=pd.merge(data,region_map, on="NOC")
data.head()
| ID | Name | Sex | Age | Height | Weight | Team | NOC | Games | Year | Season | City | Sport | Event | Medal | region | notes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | A Dijiang | M | 24.0 | 180.0 | 80.0 | China | CHN | 1992 Summer | 1992 | Summer | Barcelona | Basketball | Basketball Men's Basketball | NaN | China | NaN |
| 1 | 2 | A Lamusi | M | 23.0 | 170.0 | 60.0 | China | CHN | 2012 Summer | 2012 | Summer | London | Judo | Judo Men's Extra-Lightweight | NaN | China | NaN |
| 2 | 602 | Abudoureheman | M | 22.0 | 182.0 | 75.0 | China | CHN | 2000 Summer | 2000 | Summer | Sydney | Boxing | Boxing Men's Middleweight | NaN | China | NaN |
| 3 | 1463 | Ai Linuer | M | 25.0 | 160.0 | 62.0 | China | CHN | 2004 Summer | 2004 | Summer | Athina | Wrestling | Wrestling Men's Lightweight, Greco-Roman | NaN | China | NaN |
| 4 | 1464 | Ai Yanhan | F | 14.0 | 168.0 | 54.0 | China | CHN | 2016 Summer | 2016 | Summer | Rio de Janeiro | Swimming | Swimming Women's 200 metres Freestyle | NaN | China | NaN |
As observed the data acquired had a lot of noise. For example, the column Team specifies the team for which a player participates for a particular game. We observed that the column had data with multiple string elements from which only the first string element made sense for data processing. We observed that if a player belonged to team China, the value of that particular column was China-1. Now the only data relevant is the name of the team which is China. Therefore, to clean this column, we split the column value and only assign the first element as the row value which would be useful for analysis and processing.
#Cleaning the Team column
data['Team']=data['Team'].str.split('-|/|\(|#').str[0]
data.drop('notes',axis=1,inplace=True)
data.head()
| ID | Name | Sex | Age | Height | Weight | Team | NOC | Games | Year | Season | City | Sport | Event | Medal | region | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | A Dijiang | M | 24.0 | 180.0 | 80.0 | China | CHN | 1992 Summer | 1992 | Summer | Barcelona | Basketball | Basketball Men's Basketball | NaN | China |
| 1 | 2 | A Lamusi | M | 23.0 | 170.0 | 60.0 | China | CHN | 2012 Summer | 2012 | Summer | London | Judo | Judo Men's Extra-Lightweight | NaN | China |
| 2 | 602 | Abudoureheman | M | 22.0 | 182.0 | 75.0 | China | CHN | 2000 Summer | 2000 | Summer | Sydney | Boxing | Boxing Men's Middleweight | NaN | China |
| 3 | 1463 | Ai Linuer | M | 25.0 | 160.0 | 62.0 | China | CHN | 2004 Summer | 2004 | Summer | Athina | Wrestling | Wrestling Men's Lightweight, Greco-Roman | NaN | China |
| 4 | 1464 | Ai Yanhan | F | 14.0 | 168.0 | 54.0 | China | CHN | 2016 Summer | 2016 | Summer | Rio de Janeiro | Swimming | Swimming Women's 200 metres Freestyle | NaN | China |
The columns related to biometrics of players such as age, weight and height had a lot of missing values. As these columns are very important in order to make data analysis, the rows with missing values couldn't be eliminated. Therefore in order to fill the NaN values, we grouped the Sport column with Gender. This grouping would give us biometrics of every person for a particular sport according to the age. Now it will be likely that the missing data like weight of a person would be closer to the mean weight of all the players participating in that particular sport. But the mean weight of participants of a Sport could be different for different genders. That is why these two columns were used to group players. Using the grouped sets and finding the mean of a required biometric and replacing NaN.
#Cleaning the data by calculating the mean weight of atheletes based on Sport and Sex and replacing it with the missing values
data['Age']=data.groupby(by=["Sport","Sex"])['Age'].apply(lambda x: x.fillna(round(x.mean(),2)))
data['Height']=data.groupby(by=["Sport","Sex"])['Height'].apply(lambda x: x.fillna(round(x.mean(),2)))
data['Weight']=data.groupby(by=["Sport","Sex"])['Weight'].apply(lambda x: x.fillna(round(x.mean(),2)))
data.head()
| ID | Name | Sex | Age | Height | Weight | Team | NOC | Games | Year | Season | City | Sport | Event | Medal | region | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | A Dijiang | M | 24.0 | 180.0 | 80.0 | China | CHN | 1992 Summer | 1992 | Summer | Barcelona | Basketball | Basketball Men's Basketball | NaN | China |
| 1 | 2 | A Lamusi | M | 23.0 | 170.0 | 60.0 | China | CHN | 2012 Summer | 2012 | Summer | London | Judo | Judo Men's Extra-Lightweight | NaN | China |
| 2 | 602 | Abudoureheman | M | 22.0 | 182.0 | 75.0 | China | CHN | 2000 Summer | 2000 | Summer | Sydney | Boxing | Boxing Men's Middleweight | NaN | China |
| 3 | 1463 | Ai Linuer | M | 25.0 | 160.0 | 62.0 | China | CHN | 2004 Summer | 2004 | Summer | Athina | Wrestling | Wrestling Men's Lightweight, Greco-Roman | NaN | China |
| 4 | 1464 | Ai Yanhan | F | 14.0 | 168.0 | 54.0 | China | CHN | 2016 Summer | 2016 | Summer | Rio de Janeiro | Swimming | Swimming Women's 200 metres Freestyle | NaN | China |
We observed that despite trying to assign the mean values for weight, height and age, there were some sports for which only one player participated which had NaN values. In this case, grouping using Sport and Gender did not make any change as there is only one participant. Many such rows had the same nature. Therefore for such cases, the overall mean of required biometric value after grouping by Gender is assigned for that particular row and column.
#Calculating the overall mean of biometric value to assign to rows with missing values.
data['Height']=data.groupby(by="Sex")['Height'].apply(lambda x: x.fillna(round(x.mean(),2)))
data['Weight']=data.groupby(by="Sex")['Weight'].apply(lambda x: x.fillna(round(x.mean(),2)))
data.head()
| ID | Name | Sex | Age | Height | Weight | Team | NOC | Games | Year | Season | City | Sport | Event | Medal | region | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | A Dijiang | M | 24.0 | 180.0 | 80.0 | China | CHN | 1992 Summer | 1992 | Summer | Barcelona | Basketball | Basketball Men's Basketball | NaN | China |
| 1 | 2 | A Lamusi | M | 23.0 | 170.0 | 60.0 | China | CHN | 2012 Summer | 2012 | Summer | London | Judo | Judo Men's Extra-Lightweight | NaN | China |
| 2 | 602 | Abudoureheman | M | 22.0 | 182.0 | 75.0 | China | CHN | 2000 Summer | 2000 | Summer | Sydney | Boxing | Boxing Men's Middleweight | NaN | China |
| 3 | 1463 | Ai Linuer | M | 25.0 | 160.0 | 62.0 | China | CHN | 2004 Summer | 2004 | Summer | Athina | Wrestling | Wrestling Men's Lightweight, Greco-Roman | NaN | China |
| 4 | 1464 | Ai Yanhan | F | 14.0 | 168.0 | 54.0 | China | CHN | 2016 Summer | 2016 | Summer | Rio de Janeiro | Swimming | Swimming Women's 200 metres Freestyle | NaN | China |
# caluclating BMI for better analysis which take both weight and height into account
data['BMI']=data['Weight']/((data['Height']/100)**2)
data.reset_index(inplace=True)
Separting particpants based on season and grouping them based on region to determine the number of medals each region has won on each season
medal_list = ['Gold','Silver','Bronze']
#getting the data of all the particpants who have won a medal
medals = data[data.Medal.isin(medal_list)]
#filtering the summer olympics data and getting the count of medals for each region
data_Sum = medals[medals['Season']=='Summer']
tmp_sum =data_Sum.groupby(['region', 'Season'])['Medal'].count().to_frame()
dfSum=tmp_sum.sort_values(by='Medal',ascending=False).reset_index()
dfSum.rename(columns={'Medal':'Number of Medals'},inplace=True)
dfSum.head()
| region | Season | Number of Medals | |
|---|---|---|---|
| 0 | USA | Summer | 5002 |
| 1 | Russia | Summer | 3188 |
| 2 | Germany | Summer | 3126 |
| 3 | UK | Summer | 1985 |
| 4 | France | Summer | 1627 |
#filtering the winter olympics data and getting the count of medals for each region
data_Wint = medals[medals['Season']=='Winter']
tmp_wint =data_Wint.groupby(['region', 'Season'])['Medal'].count().to_frame()
dfWint=tmp_wint.sort_values(by='Medal',ascending=False).reset_index()
dfWint.rename(columns={'Medal':'Number of Medals'},inplace=True)
dfWint.head()
| region | Season | Number of Medals | |
|---|---|---|---|
| 0 | Russia | Winter | 759 |
| 1 | USA | Winter | 635 |
| 2 | Germany | Winter | 630 |
| 3 | Canada | Winter | 611 |
| 4 | Norway | Winter | 443 |
#below function helps us to plot a map in which has 3 parameters: season_data, title1,color
#season_data is a dataframe that contains number of medal of each region
#title1 is the title of the graph
#color represents the color scale for the maps
def map_season(season_data,title1,color):
trace = go.Choropleth(
locations = season_data['region'],
locationmode='country names',
z = season_data['Number of Medals'],
text = 'Number of Medals',
autocolorscale =False,
reversescale = False,
colorscale = color,
marker = dict(
line = dict(
color = 'black',
width = 0.5)
),
colorbar = dict(
title = 'Number of Medals',
tickprefix = ''),
showscale=True,
zmin=0
)
data = [trace]
layout = go.Layout(
title = title1,
width=1000,
height=500,
geo = dict(
showframe = True,
showlakes = False,
showcoastlines = True,
projection = dict(
type = 'natural earth'
)
)
)
fig = go.Figure(data=data, layout=layout)
fig.show()
#calling the map function of winter olympics
map_season(dfWint,'Number of Medals per Country (Winter Olympic Games)','Blues')
Looking at the heat map for the number of medals by country in the Winter season, there is a clear trend that is shown. Countries with colder climates (Canada, Russia) perform much better than other countries. On the other hand, countries close to the equator, which have typically warmer climates, perform much worse in the Winter Olympics. We can infer that the countries with colder climates have the ability to practice and train for Winter sports at a much higher capacity than countries with warm climates, allowing them to perform better during the Winter Olympics. For example, athletes from Canada and Russia are able to train skiing and snowboarding year round, while athletes from warmer countries can only train these sports when the weather allows them too. We can expect countries with colder climates to perform well in future Winter Olympic games.
#calling the map function of winter olympics
map_season(dfSum,'Number of Medals per Country (Summer Olympic Games)','Reds')
Looking at the heat map for the number of medals by country in the Summer season, we see a different trend than the Winter heat map. The top performing countries (USA, Russia, Germany) have different climates, but they all are first-world countries that have a high population. We can infer that this is because even though the climate may be colder, popular Summer sporting events take place indoors (Gymnastics, Athletics, Swimming), so even athletes from countries with cold climates can still practice and train in these sports. The infrastructure in the developed countries is much better than under developed countries, so the training facilities athletes from developed countries have access to allow them to perform better than athletes from under developed countries. We can expect countries that are developed and have high populations will perform best in future Summer Olympic games.
Using web scraping we are getting the GDP of top countries and merging it with olympics dataset to get a new data frame containing GDP and number of medals for each country
#getting GDP details from below url
html_url = 'https://statisticstimes.com/economy/projected-world-gdp-ranking.php'
#extracting the rank and GDP column using read_html
country_gdp = pd.read_html(html_url)[1]
rank = country_gdp['GDP (Nominal) (billions of $)'][['Rank','2021']]
country = country_gdp['Country/Economy'][['Country/Economy']]
#creating a dataframe containing rand and GDP
country_gdp = country.join(rank)
country_gdp.rename(columns={'2021':'GDP in Billions', 'Country/Economy': 'Country'},inplace=True)
#dropping the World GDP total
country_gdp.drop(index=50,inplace=True)
#coberting rank from float to int
country_gdp['Rank'] = country_gdp['Rank'].astype(int)
country_gdp.head()
| Country | Rank | GDP in Billions | |
|---|---|---|---|
| 0 | Pakistan | 45 | 292.22 |
| 1 | United States | 1 | 22939.58 |
| 2 | China | 2 | 16862.98 |
| 3 | Japan | 3 | 5103.11 |
| 4 | Germany | 4 | 4230.17 |
list_na = []
list_region = []
#Checking whether each country with GDP is present in the dataframe of olympics and filtering the ones that are missing
for idx,country in enumerate(country_gdp['Country']):
if len(data[data['Team']==country]['NOC'].unique())==0:
list_na.append(idx)
list_region.append(list(data[data['Team']==country]['NOC'].unique()))
# due to diffence in the names few NOCs are missing
# adding NOC of the missing countries
na_nocs = ['GBR','KOR','IRI','TPE','SIN','HKG']
dict_nocs = dict(zip(list_na,na_nocs))
for idx,noc in dict_nocs.items():
list_region[idx].append(noc)
#adding NOC column to GDP dataframe
country_gdp['NOC'] = pd.DataFrame(list_region)[[0]]
#filtering out the countries whose GDP is present
#getting total medal count of each type
medals_total_count = pd.DataFrame(data.loc[(data['NOC'].isin(list(country_gdp['NOC']))),['NOC','Medal']].value_counts()).reset_index()
medals_total_count.rename(columns={0:'Medals','Medal':'Medal Type'},inplace=True)
#grouping data based on country to get total count of medals
medals_total = medals_total_count.groupby('NOC').sum().reset_index()
#merging the total medal count and GDP based on NOC
medals_gdp_df = pd.merge(country_gdp,medals_total,how='left',on='NOC')
medals_gdp_df.sort_values(ascending=False,by='GDP in Billions',inplace=True)
medals_gdp_df.head()
| Country | Rank | GDP in Billions | NOC | Medals | |
|---|---|---|---|---|---|
| 1 | United States | 1 | 22939.58 | USA | 5637.0 |
| 2 | China | 2 | 16862.98 | CHN | 989.0 |
| 3 | Japan | 3 | 5103.11 | JPN | 913.0 |
| 4 | Germany | 4 | 4230.17 | USA | 5637.0 |
| 5 | United Kingdom | 5 | 3108.42 | GBR | 2068.0 |
#Plotting the graph for top 20 GDPS
fig, ax1 = plt.subplots(figsize=(12,6))
#plotting a line graph showing top 20 countries with highest GDP
sns.lineplot(data=medals_gdp_df.iloc[:20],y='GDP in Billions',x='Country', marker='o', sort = True, ax=ax1,color='black')
ax1.set_ylabel(ax1.get_ylabel(),size=12)
ax1.set_xlabel('Country',size=12)
ax1.set_xticklabels(ax1.get_xticklabels(),rotation=45)
#ploting the bar graph showing the number medals won by each of the above countries
ax2 = ax1.twinx()
plt_ax2=sns.barplot(data=medals_gdp_df[:20],y='Medals',x='Country', alpha=0.5, ax=ax2,palette=['blue'])
plt_ax2.set_title('Countries with best GDP and Their Medals', fontdict={'fontsize': 20, 'fontweight': 700, 'color': 'maroon'}, pad=20)
plt.show()
<ipython-input-15-b68e45b6d4b2>:8: UserWarning: FixedFormatter should only be used together with FixedLocator
Many countries have started programs of higher sport funding to increase the athletes’ performance. The two factors that might impact performance of any country could be the population and GDP of that country. Countries with higher GDP could allocate more resources, better training and appropriate infrastructure to the participants. This could improve a participant's performance thus leading to better chances of them winning medals. We have clearly found a considerate correlation between GDP of a country and the number of medals won by that country in Olympics.Nations with higher GDP have higher medal values too. The above line graph shows GDP for 20 countries. We see a very big variation in these values as the economic status of every country is different. The bar graph shows the number of medals won by a particular country. We observe that for most of the countries, the value of the number of medals won is directly proportional to their GDP. USA being the country with highest GDP value has the highest number of medals won. There are cases where even though the GDP value isn't very high, the country still has won a considerable number of medals. For example, Germany has a huge number of winners but that value is not proportionate with GDP. This could be because despite low GDP value, Germany has a lot of people participating every year, which increases their probability of winning.
Comparing two nations India and Canada, the GDP of India is much higher than that of Canada. But we observe that Canada still has more medals than India. Canada is a highly developed nation with one of the largest economies in the world whereas India still is a developing nation. This could be a reason why some countries in the above graph are outliers.
So it's understood that there is an obvious connection between the GDP of a country and its performance at the Olympics. But this relationship is very dynamic as there are a lot of other factors that play a major role in determining the performance of a nation.
To find out the success rate of participants winning each year we are selecting USA as it has highest number of medals and particpants, giving us better data to analyse.
#filtering NOC as USA and getting the Medal count and number of particpants
number_of_medals=data[data['NOC']=='USA'].groupby('Year')['Medal'].count().to_frame()
number_of_participants=data[data['NOC']=='USA'].groupby('Year')['index'].count().to_frame().rename(columns={'index':'Participants'})
#mering the medals and particpants into one data frame and calculating the success rate based on that
participants_medal_ratio=pd.merge(number_of_participants,number_of_medals,left_index=True,right_index=True,how='left')
participants_medal_ratio['Winning ratio']=participants_medal_ratio['Medal']/participants_medal_ratio['Participants']
participants_medal_ratio.reset_index(inplace=True)
participants_medal_ratio.head()
| Year | Participants | Medal | Winning ratio | |
|---|---|---|---|---|
| 0 | 1896 | 27 | 20 | 0.740741 |
| 1 | 1900 | 135 | 63 | 0.466667 |
| 2 | 1904 | 1109 | 394 | 0.355275 |
| 3 | 1906 | 81 | 24 | 0.296296 |
| 4 | 1908 | 219 | 65 | 0.296804 |
fig, ax1 = plt.subplots(figsize=(15,7))
#line graph represents the winning ratio for each year for USA
sns.lineplot(data=participants_medal_ratio['Winning ratio'], marker='o', sort = False, ax=ax1,color='black')
ax1.set_ylabel(ax1.get_ylabel(),size=12)
ax1.set_xlabel('Country',size=12)
ax1.set_xticklabels(ax1.get_xticklabels(),rotation=45)
#bar graph represents the total number of participants for each year for USA
ax2 = ax1.twinx()
plt_ax2=sns.barplot(data=participants_medal_ratio,y='Participants',x='Year', alpha=0.5, ax=ax2,palette=['brown'])
plt_ax2.set_title('Number of Participants and winning ratio for USA', fontdict={'fontsize': 20, 'fontweight': 700, 'color': 'maroon'}, pad=20)
plt.show()
<ipython-input-17-8aad0f6e922e>:7: UserWarning: FixedFormatter should only be used together with FixedLocator
It is quite evident that if number of particpants, increases the number of medals increases. But this affects the success rate of participants. From the above graph we can see that majority of the time as participants number is high like in 1904,1988 and 1992 the success rate decreases. Instead when the participants is less or moderate the sucess ratio is high like in year 2008, 2010, 2012 and 2016. This is because if you have few particpants, it is easy to focus on each of them and give them better facilities to everyone. It helps particpants feel more valuable and gives them more motivation to perform better and enhance their chance of winning a medal.
Comparing the BMI of particpants who have won the medal and the particpants who lost for each sport
#getting the mean bmi for each sport where particpants have lost
bmi_sport_loser=data[data['Medal'].isnull()].groupby(['Sport'])[ 'BMI'].mean().to_frame()
bmi_sport_loser.rename(columns={'BMI':'BMI of loser'},inplace=True)
#getting the mean bmi for each sport where particpants have won
bmi_sport_win=data[data['Medal'].notnull()].groupby(['Sport'])[ 'BMI'].mean().to_frame()
bmi_sport_win.rename(columns={'BMI':'BMI of winner'},inplace=True)
#merging the above two data frame into one with index as sport
bmi_sport=pd.merge(bmi_sport_loser,bmi_sport_win,right_index=True,left_index=True)
#plotting a heat map of bmi for each sport by creating a pivot table
sport_wise= pd.pivot_table(bmi_sport, values =[ 'BMI of winner', 'BMI of loser'],index= "Sport", aggfunc= 'mean').sort_values(by='BMI of winner', ascending= False)
sport_wise.style.background_gradient(cmap='plasma')
| BMI of loser | BMI of winner | |
|---|---|---|
| Sport | ||
| Tug-Of-War | 28.841756 | 28.660159 |
| Weightlifting | 27.369212 | 28.095725 |
| Bobsleigh | 27.121967 | 27.248596 |
| Judo | 25.527043 | 26.057466 |
| Rugby Sevens | 25.312075 | 25.869383 |
| Baseball | 25.635023 | 25.742927 |
| Wrestling | 25.116385 | 25.341874 |
| Ice Hockey | 25.206615 | 25.244001 |
| Shooting | 24.683585 | 24.955228 |
| Water Polo | 24.752512 | 24.930969 |
| Luge | 24.652977 | 24.850632 |
| Golf | 23.920372 | 24.580007 |
| Art Competitions | 24.480427 | 24.530342 |
| Polo | 24.506295 | 24.508394 |
| Racquets | 24.367252 | 24.367252 |
| Curling | 23.514347 | 24.180837 |
| Canoeing | 23.878523 | 24.126092 |
| Sailing | 23.869619 | 24.100952 |
| Handball | 24.115066 | 23.974117 |
| Alpine Skiing | 23.869115 | 23.943912 |
| Softball | 23.256626 | 23.824196 |
| Jeu De Paume | 23.692962 | 23.689476 |
| Military Ski Patrol | 23.686822 | 23.686822 |
| Roque | 23.686822 | 23.686822 |
| Croquet | 22.378033 | 23.686822 |
| Archery | 23.235688 | 23.655493 |
| Rowing | 23.524011 | 23.610648 |
| Motorboating | 23.281938 | 23.503556 |
| Basketball | 23.551886 | 23.395850 |
| Skeleton | 24.007858 | 23.276494 |
| Speed Skating | 23.125298 | 23.195008 |
| Cycling | 22.568059 | 23.036426 |
| Football | 22.947523 | 23.002465 |
| Hockey | 23.033785 | 22.971696 |
| Snowboarding | 23.085891 | 22.959145 |
| Freestyle Skiing | 22.773376 | 22.947579 |
| Fencing | 22.867228 | 22.942876 |
| Beach Volleyball | 22.680931 | 22.705498 |
| Volleyball | 22.375129 | 22.692176 |
| Athletics | 22.148774 | 22.476090 |
| Equestrianism | 22.376885 | 22.304784 |
| Swimming | 22.048272 | 22.229632 |
| Modern Pentathlon | 22.242879 | 22.229325 |
| Tennis | 22.166483 | 22.213226 |
| Badminton | 22.385184 | 22.203629 |
| Short Track Speed Skating | 22.096522 | 22.200672 |
| Biathlon | 21.907384 | 21.905501 |
| Boxing | 21.691332 | 21.871238 |
| Cross Country Skiing | 21.946201 | 21.858063 |
| Table Tennis | 22.075338 | 21.739265 |
| Gymnastics | 21.507387 | 21.477622 |
| Diving | 21.909241 | 21.467918 |
| Taekwondo | 21.664081 | 21.365381 |
| Trampolining | 21.339424 | 21.228316 |
| Nordic Combined | 21.520404 | 21.119267 |
| Figure Skating | 20.820736 | 20.954077 |
| Triathlon | 20.396695 | 20.319454 |
| Ski Jumping | 20.990095 | 20.177024 |
| Synchronized Swimming | 19.647038 | 19.769782 |
| Rhythmic Gymnastics | 17.371665 | 16.885826 |
While BMI does not measure body fat directly, it correlates pretty closely to direct measures of body fat. Therefore, BMI is an alternative for direct measures of body fat. BMI is the ratio of a person's weight to the square of height. The observation from the heat map explicitly shows that for the majority of the sports, the BMI of medal winners is comparatively higher than the BMI of non-winners. We can also see that for some sports like Rythmic Gymnastic, the BMI of winner is lower than the BMI of non-winners.
For the Sports that require more physical strength such as Weight lifting and Rugby, the players need to be heavy in weight. Since a heavy player outweighs the lighter weight player, the BMI can be seen higher in these sports. In sports that require more flexibility and grace such as Gymnastics, the player needs to be lighter in weight. Hence the BMI is low. There are some exceptions like Tug of War, in which it seems that the weight of the player is a significant factor to decide the winning team. However, it is the strategies that are equally important too to predict a winner.
Our objective here was to analyse the data to compare the number of Athletes participating in the Olympics over the years. As Olympics gained popularity and many countires gained Independence, it was expected that there would be a linear or exponential increase in the number of participating athletes. However, that was not the case. Hence, we further analyse the trend and the causes of these trends. The number of Medals vs the number of participating athletes can let us know whether the number of participants is less because of a sport no longer being recognised by Olympics or because of some other reasons.
#Creating a dataframe with the number of total participating athlets for both the Summer and Winter Olympics
tmp = data.groupby(['Year'])['Season'].value_counts()
dfv = pd.DataFrame(data={'Athletes': tmp.values}, index=tmp.index).reset_index()
#Creating a dataframe with the data of all the medal winners
data_winners = data[data.Medal.notnull()]
#Dataframe with all the medal winners of the Summer Olympics
data_summer_winners = data_winners[data_winners['Season']=='Summer']
#Calculating number of medal winners per edition of Summer Olympics
tmp_sum =data_summer_winners.groupby(['Year'])['Medal'].count().to_frame().sort_values(by= 'Year')
tmp_sum.reset_index(inplace=True)
#Creating a dataframe with the data of the Summer Olympics
dfS = dfv[dfv['Season']=='Summer'];
#Creating the trend line of Number of Athlets in Summer Olympics
traceS = go.Scatter(
x = dfS['Year'],y = dfS['Athletes'],
name='No. of Athlets in Olympics',
marker=dict(color='Red'),
mode = 'markers+lines'
)
#creating the trend line of Number of medals won by the athlets in each edition of Summer Olympics
traceM = go.Scatter(
x = tmp_sum['Year'],y = tmp_sum['Medal'],
name='No. of Medals',
marker=dict(color='Blue'),
mode = 'markers+lines'
)
#Plotting both the above trend lines
data1 = [traceS, traceM]
layout = dict(title = 'Effects of Politics on Olympics',
xaxis = dict(title = 'Year', showticklabels=True),
yaxis = dict(title = 'Number of Athlets/Medals'),
hovermode = 'closest'
)
fig = dict(data=data1, layout=layout)
iplot(fig, filename='events-athlets2')
From the line chart, we can see that there was no Olympics in the year 1916. This is because of World War 1. Similarly, the 1940 and 1944 Olympics were cancelled due to World War 2.
After World War 2, there was no major conflict that would lead to cancellation of the Olympics. However, due to political reasons, there were boycotts of the Olympics by countries. In total, there have been 6 instances when a few countries decided not to participate because of political reasons.
The Olympics boycotted were 1956, 1964, 1976, 1980, 1984, 1988. The reasons are as follows:
Our overall goal was to analyze the data and create visualizations to explore the key factors that affect countries’ performance in the Olympic Games. We aimed to analyze which factors were most important to winning, and find any trends that may allow us to predict which countries will perform best in the future Olympic Games. While performing our analysis, we also found that there were large gaps between some years that can be attributed to external political effects.
Based on our analysis, we found there are several key indicators that lead to strong performance in the Olympic Games such as: